Skip to content

Add numpy to the mypy pre-commit environment#20282

Merged
rapids-bot[bot] merged 28 commits intorapidsai:branch-25.12from
vyasr:fix/typing_numpy
Oct 16, 2025
Merged

Add numpy to the mypy pre-commit environment#20282
rapids-bot[bot] merged 28 commits intorapidsai:branch-25.12from
vyasr:fix/typing_numpy

Conversation

@vyasr
Copy link
Copy Markdown
Contributor

@vyasr vyasr commented Oct 16, 2025

Description

Contributes to #11661

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

vyasr and others added 25 commits October 16, 2025 00:45
Added numpy to the mypy additional_dependencies to enable numpy type
stubs for improved type checking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "np_dtype" at line 22.
Added explicit type annotation np.dtype[np.object_] to class attribute.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "np_dtype" at line 33.
Added explicit type annotation np.dtype[np.object_] to class attribute.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "np_dtype" at line 44.
Added explicit type annotation np.dtype[np.object_] to class attribute.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "min_date" at line 124.
Added explicit type annotation np.datetime64 for local variable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "np_dtypes_to_pandas_dtypes" at line 23.
Added explicit type annotation dict[np.dtype[Any], pd.core.dtypes.base.ExtensionDtype]
and imported Any from typing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "dtype" at line 205.
Added explicit type annotation np.dtype[Any] for local variable in loop.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "dtype" at line 219.
Added explicit type annotation np.dtype[Any] for local variable in loop.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "SUPPORTED_NUMPY_TO_PYLIBCUDF_TYPES" at line 763.
Added explicit type annotation dict[np.dtype[Any], plc.types.TypeId].

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed mypy error: Need type annotation for "_UNDERLYING_DTYPE" at line 47.
Added explicit type annotation np.dtype[np.int64] to class attribute.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed 4 mypy errors in dtypes.py by:

1. Converting is_pandas_nullable_extension_dtype to use TypeGuard[pd.core.dtypes.base.ExtensionDtype]
   - This resolved .na_value access errors on lines 255 and 262

2. Filtering dtypes to create cat_dtypes list with explicit isinstance checks
   - Ensures mypy knows all items are cudf.CategoricalDtype

3. Adding None check when filtering categorical dtypes
   - Filters out dtypes where _categories is None

4. Using explicit loop with assertions instead of list comprehensions
   - Helps mypy understand _categories is not None after filtering
   - This resolved ._categories access errors on lines 297 and 300

The TypeGuard pattern tells mypy to narrow the dtype type when the function
returns True, making attribute access type-safe without runtime overhead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added type: ignore[call-overload] comment with detailed explanation.

The issue is that numpy's type stubs for datetime64/timedelta64 constructors
only accept literal strings for the time unit parameter (like "ns", "us", etc.)
to enable compile-time validation. However, we're passing a variable string
(to_unit) which contains a time unit that we know is valid at runtime.

This is one of 20 errors on this line from numpy's overly restrictive stubs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added type: ignore[call-overload] comment with detailed explanation.

The issue is that numpy's type stubs for timedelta64 constructors only
accept literal strings for the time unit parameter to enable compile-time
validation. However, we're passing self.time_unit which is a variable
containing a valid time unit at runtime.

This is one of 30 errors on lines 313-323 from numpy's overly restrictive stubs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added type: ignore[call-overload] comment (reusing explanation from max_dist).

The issue is that numpy's type stubs for timedelta64 constructors only
accept literal strings for the time unit parameter to enable compile-time
validation. However, we're passing self.time_unit which is a variable
containing a valid time unit at runtime.

This is one of 30 errors on lines 313-323 from numpy's overly restrictive stubs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added type: ignore[call-overload] comment (reusing explanation from max_dist).

The issue is that numpy's type stubs for timedelta64 constructors only
accept literal strings for the time unit parameter to enable compile-time
validation. However, we're passing to_res which is a variable containing
a valid time unit at runtime.

This is one of 30 errors on lines 313-323 from numpy's overly restrictive stubs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added assertion to narrow type for mypy before the constructor call.

The issue is that col_dtype has a union type (DtypeObj) which includes
many types, but at this point in the code we know it's one of the decimal
types because of the check on lines 2224-2228. The assertion tells mypy
that col_dtype is specifically a decimal dtype, so type(col_dtype) will
be a decimal dtype constructor that accepts (precision, scale) arguments.

This fixes 59 errors on this single line from numpy's strict type stubs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changed comment blocks from starting with "# type: ignore[call-overload]:"
to "# call-overload must be ignored because" to avoid mypy treating them
as malformed type: ignore directives.

Fixed 2 errors at lines 165 and 317.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Group 2: Changed _NP_SCALAR from instance annotation to ClassVar to allow
subclasses to assign specific numpy scalar types (datetime64 or timedelta64).

Group 3: Added type: ignore[call-overload] comments for numpy constructor
calls with variable time unit strings, which numpy stubs don't support.

Fixed 7 total errors (3 from Group 2 + 4 from Group 3).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added type: ignore[arg-type] comments for as_interval_column and
as_decimal_column calls where mypy cannot narrow the dtype type from
the is_dtype_obj_* function checks (which are not TypeGuards).

Fixed 2 errors at lines 1734 and 1742.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed 8 mypy errors related to _get_nan_for_dtype:
1. Changed return type from DtypeObj to np.generic in dtypes.py
   - Function returns numpy scalar values like np.float64('nan') or np.datetime64('NaT')
2. Added type: ignore[return-value] comments in numerical_base.py at 7 locations
   - kurtosis() lines 93, 98
   - quantile() line 184
   - median() line 228
   - cov() line 247
   - corr() lines 255, 261

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added explicit type annotations for 12 variables across multiple files
where mypy could not infer types after adding numpy stubs:

- dtypes.py line 694: Fixed fields dict type annotation (str not bytes)
- decimal.py lines 387, 512: Added data_buf_128 ndarray annotations
- index.py lines 2882, 5175: Added dtype and child_type annotations
- frame.py line 559: Updated to_array parameter to allow None
- groupby.py line 1558: Added high ndarray annotation
- csv.py line 38: Added _CSV_HEX_TYPE_MAP dict annotation
- numeric.py line 180: Added downcast_dtype annotation
- datetimes.py line 852: Added dtype annotation
- queryutils.py line 26: Added SUPPORTED_QUERY_TYPES set annotation
- fast_slow_proxy.py line 1166: Added transformed list annotation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Line 1167: Added type: ignore for list comprehension loop variable
- Line 1174: Changed dtype=object to dtype=np.object_ for np.empty()
- Line 1389: Fixed NUMPY_TYPES annotation from set[str] to set[type[np.generic]]

These fixes address strict numpy stub type checking requirements.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Updated type signatures across multiple files to accept both int and numpy
unsigned integer types for seed parameters, resolving type inconsistencies
between public APIs, column methods, and pylibcudf stubs.

Changes:
1. Column-level methods (string.py, lists.py): Updated minhash, minhash64,
   hash_character_ngrams, minhash_ngrams, and minhash64_ngrams to accept
   int | np.uint32 or int | np.uint64 for seed parameters
2. Added runtime validation to convert int to appropriate numpy unsigned
   integer type with bounds checking before calling pylibcudf
3. Accessor-level methods (accessors/string.py): Updated method signatures
   to accept int | np.uint32 or int | np.uint64 for consistency
4. pylibcudf stubs: Updated minhash.pyi and generate_ngrams.pyi to accept
   int | np.unsignedinteger[Any] for seed parameters

This allows public APIs to accept convenient int literals while maintaining
type safety and proper conversion to numpy unsigned integers at runtime.

Progress: 19 of 200 original mypy errors remain (90.5% complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed remaining mypy type errors across core modules:

Column modules:
- struct.py: Added type narrowing for dtype assignment in to_arrow()
- numerical.py: Added union type annotations for finfo bounds
- numerical_base.py: Corrected placement of type: ignore for return value
- decimal.py: Added None check for data_buf, fixed dtype assignment in to_arrow()

Core modules:
- series.py: Added type: ignore for is_dict_like() check
- index.py: Added type: ignore for as_column return type in RangeIndex
- frame.py: Widened dtype parameter type to Any in to_array helper
- groupby.py: Inlined to_take construction to avoid assignment conflicts

Tools modules:
- numeric.py: Added type: ignore for numpy typecodes string access
- datetimes.py: Added type: ignore for np.datetime64 with non-literal unit

Pandas modules:
- _wrappers/numpy.py: Added type: ignore for conditional flagsobj import
- fast_slow_proxy.py: Corrected placement of type annotations for list comprehension and np.empty call

Stub updates:
- quantiles.pyi: Changed parameter type from Sequence[float] to Iterable[float] to accept numpy arrays

All fixes preserve runtime behavior while satisfying mypy's strict type checking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vyasr vyasr self-assigned this Oct 16, 2025
@vyasr vyasr requested review from a team as code owners October 16, 2025 00:47
@vyasr vyasr requested a review from AyodeAwe October 16, 2025 00:47
@vyasr vyasr added the improvement Improvement / enhancement to an existing function label Oct 16, 2025
@vyasr vyasr added the non-breaking Non-breaking change label Oct 16, 2025
@vyasr vyasr requested review from Matt711 and mroeschke October 16, 2025 00:47
@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Oct 16, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Oct 16, 2025
@bdice bdice mentioned this pull request Oct 16, 2025
3 tasks
Comment thread python/cudf/cudf/core/column/column.py Outdated
Comment thread python/cudf/cudf/core/frame.py Outdated
rapids-bot Bot pushed a commit that referenced this pull request Oct 16, 2025
Adding more packages to the mypy environment for validation is pushing us over the maximum size that the service allows for its environments.

The mypy checks will still run as a part of the `check-style` job which runs on our CI system, we just won't have pre-commit.ci for this check.

xref:
- https://results.pre-commit.ci/run/github/90506918/1760575643.VSn5D0uuT16kJEDbQWeS5g
> build of https://github.com/pre-commit/mirrors-mypy:types-cachetools,pyarrow-stubs,numpy@v1.13.0 for python@python3 exceeds tier max size 250MiB: 254MiB 
- #20282

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #20286
Comment thread python/cudf/cudf/core/column/column.py
@vyasr
Copy link
Copy Markdown
Contributor Author

vyasr commented Oct 16, 2025

/merge

1 similar comment
@vyasr
Copy link
Copy Markdown
Contributor Author

vyasr commented Oct 16, 2025

/merge

@rapids-bot rapids-bot Bot merged commit e534472 into rapidsai:branch-25.12 Oct 16, 2025
137 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in cuDF Python Oct 16, 2025
@vyasr vyasr deleted the fix/typing_numpy branch October 16, 2025 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf.pandas Issues specific to cudf.pandas improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants